
For this assignment, the tools used to explore the data and extract logical progressions of information will be the ggplot2 and pandas libraries, for R and Python respectively. They will be utilised together using Jupyter Notebook, which allows data to be effortlessly passed between the separate interpreter kernels.
The ggplot2 package is a plotting library for R, created by Hadley Wickham. It sits alongside his other 'Tidyverse' packages to present a uniform syntax for manipulating, plotting and analysing data. The ggplot2 library itself represents a programmatic implementation of Leland Wilkinson's 'Grammer Of Graphics'. Leland himself as also held senior positions at both Tableau and H2O.ai.
The pandas library is a package for Python which facilitates the efficient handling of large scale data, through the use of its 'DataFrame' object and many associated methods.
Firstly, a few environmental preparations & loading required libraries:
%cd /mnt/Data/Jupyter/VisCA1
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
%load_ext rpy2.ipython
%%R
require(tidyverse)
require(ggplot2)
require(plotluck)
require(reshape2)
require(cowplot)
theme_set(theme_minimal())
First, the data must be read in.
lobbying = pd.read_excel("lobbying_data.xlsx")
lobbying.tail()
Next, the monetary columns are aggregated to provide an overview of lobbying spending (a simple and quick calculation to perform with pandas), unnecessary columns are dropped and the column headers tidied up.
lobbying["Total-Lobbying"] = lobbying["Compensation"] + lobbying["Expenses"]
lobbyingTree = lobbying[["entity_type", "Entity-cleaned", "Total-Lobbying"]]
lobbyingTree.rename(columns={"entity_type":"Type", "Entity-cleaned":"Entity", "Total-Lobbying":"Lobbying"}, inplace=True)
lobbyingTree.head()
The data is now passed to R and entered into a 'data.tree' structure. This type of data structure is well optimised to hold entity relted data of this type.
%%R -i lobbyingTree
require(data.tree)
lobbyingTree$pathString <- paste('LOBBYING_TREE', lobbyingTree$Type, lobbyingTree$Entity, sep = "/")
tree <- as.Node(lobbyingTree[,])
print(tree, pruneMethod='dist', limit=20)
Next, the total lobbying values are entered and aggregated through the different levels of the tree.
%%R
tree$Do(function(x) x$aggLobbying <- Aggregate(x, 'Lobbying', sum))
tree$Sort(attribute='Lobbying', decreasing=TRUE, recursive=TRUE)
print(tree, 'Lobbying', 'aggLobbying', pruneMethod='dist', limit=14)
The data is now in the correct format to be plotted as a Treemap. The 'tremap' library is used to created the plot, which is displayed below.
%%R -w 8 -h 5 --units in -r 150
require(treemap)
treemap(ToDataFrameNetwork(tree, 'Lobbying', direction='climb')[9:139,],
index=c('from', 'to'), vSize='Lobbying', vColor='Lobbying', type='value', palette=heat.colors(-55))
An alternative way to view this data (which conveys the same information) is as a graph. Below, the igraph library is used to present the data in a different visual format.
%%R -w 12 -h 12 --units in -r 100
require(igraph)
treeGraph <- as.igraph(tree, 'aggLobbying', directed=F, direction='climb')
vertex_attr(treeGraph, 'aggLobbying')[1] <- 0
plot(treeGraph, vertex.color='lightblue', vertex.label.family='Segoe UI',
vertex.label.cex=0.55, vertex.label.color='black', layout=layout.kamada.kawai,
vertex.size=vertex_attr(treeGraph, 'aggLobbying') / (max(vertex_attr(treeGraph, 'aggLobbying')) * 0.035))
These visualisations show that a large proportion of the lobbying money is coming from the cities entity-type. Further, it is clear to see that by far the largest monetary amount has come from Seattle. While this is intesting, it would be more revealing to look at how these payments were distributed geographically. The next question will explore this objective.
In order to explore this data geographically, it is necessary to introduce geographical (longitude/latitude) columns into the dataframe. In this case, this data will be joined onto the current dataset from a publicly available dataset containing American cities, states, counties, zipcodes and long/lat columns.
The .csv file is read into pandas and filtered down to remove all non-Washington addresses. Next, all rows which contain a duplicate of the 'city' field are removed (to provide a clean, singular location). The city is then set as the dataframe index in parparation of being joined onto the lobbying dataframe.
longlat = pd.read_csv("zip_codes_states.csv")
longlat = longlat[longlat["state"] == "WA"]
longlat.drop_duplicates(subset="city", inplace=True)
longlat.set_index("city", inplace=True)
longlat.head()
The dataframes are now joined and column names cleaned up. In order to ensure the dataset is clean a query is performed to return all rows containing NaN values.
lobbyingGeo = lobbying[lobbying["entity_type"] == "CITIES"].join(longlat, how="left", on="City")[
["Total-Lobbying", "City", "county", "longitude", "latitude"]]
lobbyingGeo.rename(columns={"Total-Lobbying":"lobbying", "City":"city"}, inplace=True)
lobbyingGeo.drop(lobbyingGeo.index[44], inplace=True) # duplicate row
lobbyingGeo[lobbyingGeo.isnull().any(1)]
Since there are only a few rows which have been missed, the data for these location coordinates has been extracted from 'http://citylatitudelongitude.com/' and added into the dataframe manually.
lobbyingGeo.loc[0, "county"], lobbyingGeo.loc[0, "latitude"], lobbyingGeo.loc[0, "longitude"] = "King", 47.281954, -122.250388
lobbyingGeo.loc[6, "county"], lobbyingGeo.loc[6, "latitude"], lobbyingGeo.loc[6, "longitude"] = "King", 47.46917, -122.364291
lobbyingGeo.loc[8, "county"], lobbyingGeo.loc[8, "latitude"], lobbyingGeo.loc[8, "longitude"] = "King", 47.364791, -122.104563
lobbyingGeo.loc[12, "county"], lobbyingGeo.loc[12, "latitude"], lobbyingGeo.loc[12, "longitude"] = "Pierce", 47.23206, -122.351726
lobbyingGeo.loc[38, "county"], lobbyingGeo.loc[38, "latitude"], lobbyingGeo.loc[38, "longitude"] = "King", 47.44333, -122.298767
lobbyingGeo.loc[41, "county"], lobbyingGeo.loc[41, "latitude"], lobbyingGeo.loc[41, "longitude"] = "King", 47.756904, -122.342414
lobbyingGeo.head()
A further check is is carried out to discover if there are any duplicate rows.
duplicates = len(lobbyingGeo.drop_duplicates(subset=["longitude", "latitude"])) != len(lobbyingGeo)
print("Any Duplicate coordinates present?:", duplicates)
Since duplicate rows have been discovered, these rows have been queried (to return the correct index of these records) and then manually re-entered.
lobbyingGeo[lobbyingGeo["latitude"] ==
float(lobbyingGeo[~lobbyingGeo.isin(lobbyingGeo.drop_duplicates(
subset=["longitude", "latitude"]))].dropna()["latitude"])]
lobbyingGeo.loc[11, "latitude"], lobbyingGeo.loc[11, "longitude"] = 47.322323, -122.312622
lobbyingGeo.loc[39, "latitude"], lobbyingGeo.loc[39, "longitude"] = 47.608013, -122.335167
When re-running the duplicates check, the following result is confirmed.
print("Any Duplicate coordinates present?:", len(lobbyingGeo.drop_duplicates(subset=["longitude", "latitude"])) != len(lobbyingGeo))
The data is now in a suitable state for visualising. It is passed into R and plotted. Two visualisations are shown below. On the left is a density plot, showing a broad overview which highlights areas of the state of Washington that had a generally significant concentration of datapoints. This chart reveals that the vast majority of the payments are coming from the highly urbanised area around Seattle. On the right, the location and scale of the payments coming from the Seattle area are shown on an urban roadmap.
The maps are pulled from Google Maps and are then overlayed with visual information through the use of the ggplot2 library. The minimum and maximum longitude and latitude values have been used in calculations to automatically set locational central points for the visualisations.
The 'plot_grid()' function of the 'cowplot' package is used to plot the visualisations side by side. This library is used throughout the rest of the document in order to arrange ggplot objects in multiple-plot formats.
%%R -i lobbyingGeo -w 7 -h 4 --units in -r 180
require(ggmap)
lat <- c(min(lobbyingGeo$latitude), max(lobbyingGeo$latitude))
lon <- c(min(lobbyingGeo$longitude), max(lobbyingGeo$longitude))
plot <- get_map(location=c(mean(lon), mean(lat)), zoom=7, source='google', maptype='terrain')
state <- ggmap(plot, extent='device') +
geom_density2d(data=lobbyingGeo, aes(x=longitude, y=latitude), size=0.1, color='black') +
stat_density2d(data=lobbyingGeo,
aes(x=longitude, y=latitude, fill=..level.., alpha=..level..), size=0.01,
bins=16, geom='polygon') +
scale_fill_gradient(low='green', high='red', guide=FALSE) +
scale_alpha(range=c(0, 0.3), guide=FALSE)
plot <- get_map(location=c(median(lobbyingGeo$longitude), median(lobbyingGeo$latitude)),
zoom=9, source='google', maptype='roadmap')
cities <- ggmap(plot, extent='device') +
geom_point(aes(x=longitude, y=latitude), data=lobbyingGeo, col="darkred", alpha=0.4,
size=lobbyingGeo$lobbying/max(lobbyingGeo$lobbying)*15) +
scale_alpha(range = c(0, 0.3), guide = FALSE)
plot_grid(state, cities, ncol=2, labels=NULL)
On visual observation, it is clear that the vast majority of lobbying money is coming from this highly urbanised area. The previous treemap and graph plot highlighted that a large sum of money was coming from Seattle itself, but that did not show the true scope of the concentration of payments.
Questions are raised such as whether the lobbying data is being documented accurately. Is it necessary for these highly localised payments to be so distributed, when the distribution lays over such a small area? Further, this chart doesn't consider lobbying payments made from private organisations, businesses and societies. These organisations are also highly likely to be centred inside the city area, and are quite possibly in some cases connected to the politicians and individuals associated with the payments through public bodies.
The dataset chosen for this exercise is the 'Refinery Accidents in the United States'.
Firstly, the data is read into pandas. The final row has been ommited as this row was a 'Totals' record. If it had been left in the data, it would have heavily skewed the proportionate distribution of the data in comparison with the other records.
refinery = pd.read_excel("19. Data Set - Module 6 - accidents-state.xlsx")[:-1]
refinery.tail()
To map this data, 'shape file' data is needed. The 'maps' package inside R contains many useful datasets of the type. The shape data for states is loaded into the R kernal and exported for use by pandas in Python.
%%R -o all_states
require(maps)
all_states <- map_data("state")
head(all_states)
Next, the two datasets are to be joined. Before joining, an exploratory query is performed in order to assess whether the data is complete enough to sucessfully be combined. To allow the join to accurately match the state names, the names in the original dataset are first converted to all lower case. A right join is then performed (so that no rows from the original dataset are lost) and a query is performed to display any NaN values present.
refinery["State"] = refinery["State"].str.lower()
nulls = all_states.join(refinery.set_index("State"), how="right", on="region")
nulls[nulls.isnull().any(1) == True]
# note on how guam, hawaii and alaska are low so will be ignored.
# puerto rico and the virgin islands very high so will be ignored and explored for further interest.
# all will be left in the frame in order to facilitate any potenential calculations.
All the mainland states were sucessfully joined, with any remaining NaN values being the result of offshore states. Most of these do not show particularly interesting data, however the row for the Virgin Islands is quite astonising. While there is only one facility, there have been numerous accidents and a large amount of collateral damage. We are primarily interested in the mainland states however, so will ignore these rows.
The two datasets are now joined, and the columns renamed with more tidy (and functionally useful) titles.
refinery.rename(columns={"# of RMP facilities":"facility_count", "# of accidents":"accidents",
"# of deaths":"deaths", "# of injuries":"injuries",
"# evacuated":"evacuated", "Property damage (dollars)":"property_damage"}, inplace=True)
refineryGeo = all_states.join(refinery.set_index("State"), how="right", on="region")
refineryGeo.drop(columns="subregion", inplace=True)
refineryGeo.head()
Next, the data in the newly joined frame is plotted. The shape file can be filled in with colour gradients, in order to visualise the scale of each column of interest across America. This time, a black and white map is loaded from the 'Stamen' maps API (through ggmap) and the shapfile plotted over it. The plot_grid() function is used to combined multiple plots together. Colours have been assigned to differentiate between general metrics, collateral damage and human cost.
%%R -i refineryGeo -w 7 -h 3 --units in -r 180
lat <- c(min(all_states$lat), max(all_states$lat))
lon <- c(min(all_states$lon), max(all_states$lon))
plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='stamen', maptype='toner')
facility_count <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=facility_count), colour='black', size=0.1, alpha=0.4) +
scale_fill_continuous(low='thistle2', high='darkblue', guide=FALSE)
damage <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=property_damage), colour='black', size=0.1, alpha=0.4) +
scale_fill_continuous(low='thistle2', high='chocolate', guide=FALSE)
injuries <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=injuries), colour='black', size=0.1, alpha=0.4) +
scale_fill_continuous(low='thistle2', high='darkred', guide=FALSE)
accidents <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=accidents), colour='black', size=0.1, alpha=0.4) +
scale_fill_continuous(low='thistle2', high='darkblue', guide=FALSE)
evacuated <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=evacuated), colour='black', size=0.1, alpha=0.4) +
scale_fill_continuous(low='thistle2', high='chocolate', guide=FALSE)
deaths <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=deaths), colour='black', size=0.1, alpha=0.4) +
scale_fill_continuous(low='thistle2', high='darkred', guide=FALSE)
plot_grid(facility_count, damage, deaths, accidents, evacuated, injuries, nrow=2, ncol=3,
labels=c('Facilities', 'Damage', 'Deaths', 'Accidents', 'Evacuated', 'Injuries'))
White this display is visually appealing and seems informative, it is flawed. Each state has a diffing number of facilities, meaning that the larger states and states with highest counts of facilities will obviously show a much higher value across all categories related to fallout from an accident. This is the reason that almost every metric simply displays Texas as state of highest proportion.
In order to correct for this, each fallout related column is divided by the total number of facilities in that state. This gives a measure of accidents per facility and should provide a much more level and useful metric of the comparitive safety levels in each state. The adjusted columns are added to the dataframe and a further 'MinMaxScaler' is imported from the SciKit-Learn Machine Learning library and applied, in order to set all values to between 0 and 1.
refinery = refinery.set_index("State").drop(index=[region for region in nulls[nulls.isnull().any(1) == True].region]).reset_index()
refinery["scaled_accidents"] = pd.DataFrame([row["accidents"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_damage"] = pd.DataFrame([row["property_damage"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_evacuated"] = pd.DataFrame([row["evacuated"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_deaths"] = pd.DataFrame([row["deaths"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_injuries"] = pd.DataFrame([row["injuries"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
refinery_scaled = pd.concat([refinery.iloc[:,[0, 1, 2, 3, 4, 5, 6]],
pd.DataFrame(scaler.fit_transform(refinery.loc[:,[
"scaled_accidents", "scaled_damage", "scaled_evacuated", "scaled_deaths", "scaled_injuries"]]),
columns=["scaled_accidents", "scaled_damage", "scaled_evacuated", "scaled_deaths", "scaled_injuries"])], axis=1)
refineryGeo_scaled = all_states.join(refinery_scaled.set_index("State"), how="inner", on="region")
refineryGeo_scaled.drop(columns="subregion", inplace=True)
refineryGeo_scaled.head()
With the asjusted and scaled dataset prepared, the data can be plotted again.
%%R -i refineryGeo_scaled -w 7 -h 3 --units in -r 180
lat <- c(min(all_states$lat), max(all_states$lat))
lon <- c(min(all_states$lon), max(all_states$lon))
plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='stamen', maptype='toner')
facility_count <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=facility_count), colour='black', size=0.1, alpha=0.4) +
scale_fill_continuous(low='thistle2', high='darkblue', guide=FALSE)
damage <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_damage), colour='black', size=0.1, alpha=0.4) +
lims(fill=c(0, 1)) +
scale_fill_continuous(low='thistle2', high='chocolate') +
theme(legend.position='none')
injuries <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_injuries), colour='black', size=0.1, alpha=0.4) +
lims(fill=c(0, 1)) +
scale_fill_continuous(low='thistle2', high='darkred') +
theme(legend.position='none')
accidents <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_accidents), colour='black', size=0.1, alpha=0.4) +
lims(fill=c(0, 1)) +
scale_fill_continuous(low='thistle2', high='darkblue') +
theme(legend.position='none')
evacuated <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_evacuated), colour='black', size=0.1, alpha=0.4) +
lims(fill=c(0, 1)) +
scale_fill_continuous(low='thistle2', high='chocolate') +
theme(legend.position='none')
deaths <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_deaths), colour='black', size=0.1, alpha=0.4) +
lims(fill=c(0, 1)) +
scale_fill_continuous(low='thistle2', high='darkred') +
theme(legend.position='none')
plot_grid(facility_count, damage, deaths, accidents, evacuated, injuries, nrow=2, ncol=3,
labels=c('Facilities', 'Damage', 'Deaths', 'Accidents', 'Evacuated', 'Injuries'))
This time, a much more informative result is delivered. While the 'Facilities' plot is identical, merely directly representing the number of facilities on that state, the other visualisations are all scaled according to the number of facilities in that state. The 'Accidents' overview shows a relatively evenly distributed spread of accidents when weighted against number of facilities, thus we can conclude that the nature of the collateral damage and human cost can be attributed to the nature of the accidents as well as to the response (as opposed to the probability of accidents happening themselves). For example, we can note that while California shows a high proportion of Injuries and Evacuation, it has very few deaths. Conversely, while Washington shows low levels of Injury, Damage and Evacuation, it's Death proportion is high. There are also anomalous exceptions such as Wyoming, which shows an especially high Damage level. As noted, since no specifically higher rate of accidents per facility across the states is observed, these differences could be due to the nature of the refineries themselves and their safety procedures, or in how the emergency procedures for the accidents were handled in cases where they did occur.
Next, this information will be presented in a form which can be more easily, quickly and intuitively assimilated. It was mentioned before that the columns can be divided between collateral damage and human cost. These two columns will be aggregated together and presented in the form of those two categories of interest.
Additionally, a larger map will be presented to display a generalised overall 'risk factor' associated with refineries in that state. To facilitate this, the dimensionality of the data is reduced down using the PCA processing module from the SciKit-Learn library. Before applying the PCA transformation, the data is normalised to a standard normal distribution using the 'StandardScaler' module, also from SciKit-Learn. Once aggregated, the data is plotted on a map.
Finally, because it has now been validated that refineries suffer accidents generally evenly, the two refinery metric charts are of less interest and are omitted.
from sklearn.preprocessing import StandardScaler
x = refinery_scaled.loc[:, ["scaled_accidents", "scaled_damage", "scaled_evacuated", "scaled_deaths", "scaled_injuries"]].values
y = refinery_scaled.loc[:,["State"]].values
x = StandardScaler().fit_transform(x)
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(principalComponents)
principalDf["Overall"] = pd.DataFrame([row[0]+row[1]+row[2] for index, row in principalDf.iterrows()])
refinery_PCA = pd.concat([refinery_scaled[["State"]], principalDf["Overall"]], axis=1)
refineryGeo_PCA = all_states.join(refinery_PCA.set_index("State"), how="inner", on="region")
refineryGeo_PCA.drop(columns="subregion", inplace=True)
refineryGeo_PCA.head()
The plot is created using the familiar methods from other recent map plots. It is displayed below.
%%R -i refineryGeo_PCA -w 8 -h 3 --units in -r 150
lat <- c(min(all_states$lat), max(all_states$lat))
lon <- c(min(all_states$lon), max(all_states$lon))
plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='stamen', maptype='toner')
collateral <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=(scaled_damage+scaled_evacuated)/2), colour='black', size=0.1, alpha=0.4) +
lims(fill=c(0, 1)) +
scale_fill_continuous(low='thistle2', high='chocolate') +
theme(plot.title=element_text(size=8), legend.position='none') +
ggtitle('Collateral Damage')
human_cost <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) +
geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=(scaled_injuries+scaled_deaths)/2), colour='black', size=0.1, alpha=0.4) +
lims(fill=c(0, 1)) +
scale_fill_continuous(low='thistle2', high='darkred') +
theme(plot.title=element_text(size=8), legend.position='none') +
ggtitle('Human Cost')
plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='google', maptype='terrain')
overall_risk <- ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand=c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand=c(0, 0)) +
geom_polygon(data=refineryGeo_PCA, aes(x=long, y=lat, group=group, fill=Overall), colour='black', size=0.0, alpha=0.4) +
lims(fill=c(min(refineryGeo_PCA$Overall), max(refineryGeo_PCA$Overall))) +
scale_fill_continuous(low='palegreen3', high='darkred') +
theme(plot.title=element_text(size=8), legend.position='none') +
ggtitle('Generalised Risk Level') +
scale_alpha(range=c(0, 0.3), guide=FALSE)
ggdraw() +
draw_plot(collateral, width=0.45, height=0.45, x=0, y=0.5) +
draw_plot(human_cost, width=0.45, height=0.45, x=0, y=0) +
draw_plot(overall_risk, width=0.95, height=0.95, x=0.2, y=0)
The dataset that is to be used here is very messy, so requires some initial work to bring into a usable form. Firstly, it is read into pandas. Because it is part of a multi-sheet excel file, the sheet number has also been specified.
bridgeData = pd.read_excel("Data Set - Module 7 - wa-15.xlsx", sheet_name=0)
bridgeData.tail()
A data-dictionary is also supplied with this file. While the columns of the primary DataFrame do not directly correlate with the rows in the dictionary, is does provide valuable insight into the data available and how to access/interpret it. It is also read in in the same manner as the primary DataFrame and will be referred to in an ad-hoc basis when additional information on the data is required.
bridgeDict = pd.read_excel("Data Module 7 record layout - reclay_1_1.xls", sheet_name=0)
bridgeDict[90:].head()
This information is used to explore and query the primary DataFrame as follows. The column name or number is not directly related to the primary data, but it is close enough to quickly locate the desired data.
bridgeData.iloc[:, [29, 33]].head()
The second sheet in the primary spreadsheet contains a small dataset with rows connecting the county names with the county fips codes. This will be necessary, so it is read in and assigned to a variable for future use.
bridgeCounties = pd.read_excel("Data Set - Module 7 - wa-15.xlsx", sheet_name=1)
bridgeCounties["COUNTY"] = bridgeCounties["COUNTY"].str.lower()
bridgeCounties.head()
Another shapefile will be needed in order to map and divide the counties of Washington. This will again be imported from the R 'maps' library. The county level data is imported and filtered down to only washington counties, then passed out to Python's kernel in preparation for future use.
%%R -o wa_county
counties <- map_data("county")
wa_county <- subset(counties, region == 'washington')
tail(wa_county)
Now that all the required data is loaded, the dataset that will be used for analysis can be constructed. A new DataFrame is instantiated and the primary columns of interest are assigned and renamed to more useful headers.
bridges = pd.DataFrame()
bridges["lon"] = bridgeData["longmap"]
bridges["lat"] = bridgeData["latmap"]
bridges["county"] = bridgeData["cnty"]
bridges["status"] = bridgeData["stat2"]
bridges["rating"] = bridgeData["suffrtno"]
bridges.head()
Next, the table is filtered down to display only bridges marked as structurally deficient (holding a value of 1 or 2 in the column derived from the original 'stat2') as well as having a rating of 50 or less (the metric at which they are considered in urgent need of replacement or repair).
bridgesRep = bridges[((bridges["status"] == 2) | (bridges["status"] == 1)) & (bridges["rating"] <= 50)]
bridgesGeo = bridgeCounties[["COUNTY", "CNTYFIPS"]].join(bridgesRep.set_index("county"), how="inner", on="CNTYFIPS").reset_index(drop=True)
bridgesGeo.rename(columns={"COUNTY":"county"}, inplace=True)
bridgesGeo.drop(columns=["CNTYFIPS", "status"], inplace=True)
bridgesGeo.tail()
Viewing the tail of the dataset reveals that there are 389 bridges in this 'high risk' category. Initially, a density plot will be created as a 'broad' overview on the distribution of deficient bridges.
%%R -i wa_county -i bridgesGeo -w 5 -h 3 --units in -r 200
lat <- c(min(wa_county$lat), max(wa_county$lat))
lon <- c(min(wa_county$lon), max(wa_county$lon))
plot <- get_map(location=c(mean(lon), mean(lat)), zoom=6, source='google', maptype='hybrid')
ggmap(plot, extent='device') +
scale_x_continuous(limits=c(min(lon), max(lon)), expand=c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand=c(0, 0)) +
scale_alpha(range=c(0, 0.3), guide = FALSE) +
theme_void() +
labs(title='Concentration of Deficient Bridges') +
geom_density2d(data=bridgesGeo, aes(x=lon, y=lat), size=0.1, color='white') +
stat_density2d(data=bridgesGeo,
aes(x=lon, y=lat, fill=..level.., alpha=..level..), size=0.01,
bins=16, geom='polygon') +
scale_fill_gradient(low='green', high='red', guide=FALSE)
It can be observed that the unsafe bridges are distributed widely over the entire state with emphasis on the major population zones. There is a particular concentration around the Seattle area. This could be logical due to the higher concentration of traffic, however it would be expected that these bridges under higher strain would be more frequently and vigilently maintained. This line of inquisition can be explored further.
In the following plot, much more detail will be included than in this broad overview visualisation. In order to facilitate this, rather than plotting on top of a map image, the plot will be drawn from the basic shape file and built up manually. This will provide a much cleaner canvas to work with, and more control over what elements are included. This will allow more aspects of the data to be visualised without the image becoming over-complicated.
So far, the Google and Stamen maps APIs have been relied on in order to provide contextual geographic information. Plotting without the underlaying map will require this information to be added manually. This can be achieved through ggplot2's 'geom_text()' function. In order to plot the names, geographically centred coordinates for each county are required. These are aquired by filtering down and grouping the US zip codes data that was previously utilised by state and county, then calculating the median for each aggregation. This provides a roughly central measure that should allow the names to be plotted correctly.
wa_counties = pd.read_csv("zip_codes_states.csv")
wa_counties = wa_counties[wa_counties["state"] == "WA"]
wa_counties = wa_counties.groupby(by="county").median().reset_index()
wa_counties.drop(columns=["zip_code"], inplace=True)
wa_counties.head()
Finally, the average level of traffic over each bridge per county is calculated.
bridges["traffic"] = pd.to_numeric(bridgeData["avdayno"], errors="coerce")
bridgesTraffic = bridges[((bridges["status"] == 2) | (bridges["status"] == 1)) & (bridges["rating"] <= 50)]
bridgesGeoTraffic = bridgeCounties[["COUNTY", "CNTYFIPS"]].join(bridgesTraffic.set_index("county"), how="inner", on="CNTYFIPS").reset_index(drop=True)
bridgesGeoTraffic.rename(columns={"COUNTY":"county"}, inplace=True)
bridgesGeoTraffic.drop(columns=["CNTYFIPS", "status"], inplace=True)
trafficMap = wa_county.join(bridgesGeoTraffic[["county", "traffic"]].groupby(by="county").mean(), on="subregion")
trafficMap.head()
The visualisation is now plotted. All the methodology used has been utilised in previous plots, however here it is carefully implemented to show a number of metrics all at once. Bridges meeting our high risk criteria are represented by shaded points - the deeper the colour the more severe the rating of the bridge. The shading of each county represents the average levels of daily traffic over high risk bridges in that county. The density lines from the previous plot are also kept, in order to accentuate the overall distribution of high risk bridges.
%%R -i wa_county -i wa_counties -i bridgesGeo -i trafficMap -w 6 -h 4 --units in -r 200
lat <- c(min(wa_county$lat), max(wa_county$lat))
lon <- c(min(wa_county$lon), max(wa_county$lon))
ggplot() +
geom_polygon(data=trafficMap, aes(x=long, y=lat, group=group, fill=traffic), color='black', alpha=0.8, size=0.2) +
scale_x_continuous(limits=c(min(lon), max(lon)), expand=c(0, 0)) +
scale_y_continuous(limits=c(min(lat), max(lat)), expand=c(0, 0)) +
geom_point(data=bridgesGeo, aes(x=lon, y=lat, color=rating), alpha=0.5, size=0.8) +
scale_colour_gradient(low='thistle2', high='darkred', guide=FALSE) +
scale_alpha(range=c(0, 0.3), guide=FALSE) +
theme_void() +
labs(title='Severity Rating of Deficient Bridges') +
geom_text(data=wa_counties, aes(x=longitude, y=latitude, label=county), color="black", size=2) +
geom_density2d(data=bridgesGeo, aes(x=lon, y=lat), size=0.08, alpha=0.6, color='white') +
scale_fill_continuous(low='forestgreen', high='darkred')
The plot clearly displays that there are high risk bridges distributed across the entire of the state. As we noted from the previous visualisation, the largest concentration of high risk bridges is centred around the highly urbanised Seattle area. This area also sees (besides one anomalous result of Asotin, which may be seeing heavier cross state traffic dut to its location) that the Seattle area has much higher average traffic crossing those bridges than other states. This could go some way towards explaining the higher concentration of high risk bridges in this area.
There is also a notably higher level of traffic crossing dangerous bridges in Skagit county. While it is close to the Seattle area, it shows a higher level of traffic crossing than would be expected, in comparison to other similarly located states. A highly publicised bridge collapse occured in Skagit in 2007. This bridge can be isolated and explored in the data.
It is known that this bridge was built in 1955. All the Skagit bridges built in 1955 which classify a high risk bridges under our definition can be queried by this information. The date of construction is added to the prepared dataset, along with a couple of other columns of interest.
bridges["built"] = bridgeData["year"]
bridges["inspmnth"] = bridgeData["inspmnth"]
bridges["inspyear"] = bridgeData["inspyear"]
bridges["inspfreq"] = bridgeData["inspfreq"]
bridgesSkagit1955 = bridgeCounties[["COUNTY", "CNTYFIPS"]].join(bridges.set_index("county"), how="inner", on="CNTYFIPS").reset_index(drop=True)
bridgesSkagit1955.rename(columns={"COUNTY":"county"}, inplace=True)
bridgesSkagit1955.drop(columns=["CNTYFIPS"], inplace=True)
bridgesSkagit1955[((bridgesSkagit1955["status"] == 2) | (bridgesSkagit1955["status"] == 1)) & (bridgesSkagit1955["rating"] <= 50)
& (bridgesSkagit1955["built"] == 1955) & (bridgesSkagit1955["county"] == "skagit")]
This is the bridge which collapsed. Our query returns a singular bridge, which fits the description precisely. It can be verified that this is the correct bridge by comparison with information on that bridge here: https://bridgehunter.com/wa/skagit/4794A0000000/.
It is observable that this bridge has a very high level of daily traffic compared to other high risk bridges.
print("Mean daily traffic of all high risk bridges:", bridges[((bridges["status"] == 2) | (bridges["status"] == 1)) & (bridges["rating"] <= 50)]["traffic"].mean())
It can also be observed that this bridge is scheduled to be inspected every 24 months and that the last inspection was June 2015 (since the repair of the damage suffered on the bridge's collapse).